Goto

Collaborating Authors

 Karnes County


HGC-Herd: Efficient Heterogeneous Graph Condensation via Representative Node Herding

Ou, Fuyan, Ai, Siqi, Hu, Yulin

arXiv.org Artificial Intelligence

Heterogeneous graph neural networks (HGNNs) have demonstrated strong capability in modeling complex semantics across multi-type nodes and relations. However, their scalability to large-scale graphs remains challenging due to structural redundancy and high-dimensional node features. Existing graph condensation approaches, such as GCond, are primarily developed for homogeneous graphs and rely on gradient matching, resulting in considerable computational, memory, and optimization overhead. We propose HGC-Herd, a training-free condensation framework that generates compact yet informative heterogeneous graphs while maintaining both semantic and structural fidelity. HGC-Herd integrates lightweight feature propagation to encode multi-hop relational context and employs a class-wise herding mechanism to identify representative nodes per class, producing balanced and discriminative subsets for downstream learning tasks. Extensive experiments on ACM, DBLP, and Freebase validate that HGC-Herd attains comparable or superior accuracy to full-graph training while markedly reducing both runtime and memory consumption. These results underscore its practical value for efficient and scalable heterogeneous graph representation learning.



Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities

Latifi, Payam

arXiv.org Artificial Intelligence

This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system's output against the manually annotated gold standard dataset using F1-score. The results show that LLMs generally outperform conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza demonstrate greater consistency in structured tags such as LOCATION and DATE. We also observed variability among LLMs, particularly in handling temporal expressions and multi-word organizations. Our findings highlight that while LLMs offer improved contextual understanding, traditional tools remain competitive in specific tasks, informing model selection.


Investigating the Impact of Observation Space Design Choices On Training Reinforcement Learning Solutions for Spacecraft Problems

Hamilton, Nathaniel, Dunlap, Kyle, Hobbs, Kerianne L

arXiv.org Artificial Intelligence

AAS 25-147 INVESTIGATING THE IMP ACT OF OBSERVATION SP ACE DESIGN CHOICES ON TRAINING REINFORCEMENT LEARNING SOLUTIONS FOR SP ACECRAFT PROBLEMS Nathaniel Hamilton *, Kyle Dunlap, and Kerianne L. Hobbs Recent research using Reinforcement Learning (RL) to learn autonomous control for spacecraft operations has shown great success. However, a recent study showed their performance could be improved by changing the action space, i.e. control outputs, used in the learning environment. This has opened the door for finding more improvements through further changes to the environment. The work in this paper focuses on how changes to the environment's observation space can impact the training and performance of RL agents learning the spacecraft inspection task. The studies are split into two groups. The first looks at the impact of sensors that were designed to help agents learn the task. The second looks at the impact of reference frames, reorienting the agent to see the world from a different perspective. The results show the sensors are not necessary, but most of them help agents learn more optimal behavior, and that the reference frame does not have a large impact, but is best kept consistent. INTRODUCTION Autonomous spacecraft operation is a critical capability for managing the growing number of space and increasingly complex operations.


A study on the effects of mixed explicit and implicit communications in human-virtual-agent interactions

Campos, Ana Christina Almada, Adorno, Bruno Vilhena

arXiv.org Artificial Intelligence

Communication between humans and robots (or virtual agents) is essential for interaction and often inspired by human communication, which uses gestures, facial expressions, gaze direction, and other explicit and implicit means. This work presents an interaction experiment where humans and virtual agents interact through explicit (gestures, manual entries using mouse and keyboard, voice, sound, and information on screen) and implicit (gaze direction, location, facial expressions, and raise of eyebrows) communication to evaluate the effect of mixed explicit-implicit communication against purely explicit communication. Results obtained using Bayesian parameter estimation show that the number of errors and task execution time did not significantly change when mixed explicit and implicit communications were used, and neither the perceived efficiency of the interaction. In contrast, acceptance, sociability, and transparency of the virtual agent increased when using mixed communication modalities (88.3%, 92%, and 92.9% of the effect size posterior distribution of each variable, respectively, were above the upper limit of the region of practical equivalence). This suggests that task-related measures, such as time, number of errors, and perceived efficiency of the interaction, have not been influenced by the communication type in our particular experiment. However, the improvement of subjective measures related to the virtual agent, such as acceptance, sociability, and transparency, suggests that humans are more receptive to mixed explicit and implicit communications.


What is the best RNN-cell structure to forecast each time series behavior?

Khaldi, Rohaifa, Afia, Abdellatif El, Chiheb, Raddouane, Tabik, Siham

arXiv.org Artificial Intelligence

It is unquestionable that time series forecasting is of paramount importance in many fields. The most used machine learning models to address time series forecasting tasks are Recurrent Neural Networks (RNNs). Typically, those models are built using one of the three most popular cells: ELMAN, Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU) cells. Each cell has a different structure and implies a different computational cost. However, it is not clear why and when to use each RNN-cell structure. Actually, there is no comprehensive characterization of all the possible time series behaviors and no guidance on what RNN cell structure is the most suitable for each behavior. The objective of this study is twofold: it presents a comprehensive taxonomy of almost all time series behaviors and provides insights into the best RNN cell structure for each time series behavior. We conducted two experiments: (1) We evaluate and analyze the role of each component in the LSTM-Vanilla cell by creating 11 variants based on one alteration in its basic architecture (removing, adding, or substituting one cell component). (2) We evaluate and analyze the performance of 20 possible RNN-cell structures. To evaluate, compare, and select the best model, different statistical metrics were used: error-based metrics, information criterion-based metrics, naive-based metrics, and direction change-based metrics. To further improve our confidence in the models interpretation and selection, the Friedman Wilcoxon-Holm signed-rank test was used. Our results advocate the usage and exploration of the newly created RNN variant, named SLIM, in time series forecasting thanks to its high ability to accurately predict the different time series behaviors, as well as its simple structural design that does not require expensive temporal and computing resources.


Can Vision-Language Models Think from a First-Person Perspective?

Cheng, Sijie, Guo, Zhicheng, Wu, Jingwen, Fang, Kechen, Li, Peng, Liu, Huaping, Liu, Yang

arXiv.org Artificial Intelligence

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.


PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Cao, Qingqing, Paranjape, Bhargavi, Hajishirzi, Hannaneh

arXiv.org Artificial Intelligence

Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to progressively reduce the tokens of input image and text, improving model inference speed and reducing memory footprint. PuMer learns to keep salient image tokens related to the input text and merges similar textual and visual tokens by adding lightweight token reducer modules at several cross-modal layers in the VL model. Training PuMer is mostly the same as finetuning the original VL model but faster. Our evaluation for two vision language models on four downstream VL tasks shows PuMer increases inference throughput by up to 2x and reduces memory footprint by over 50% while incurring less than a 1% accuracy drop.


MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks

Bara, Cristian-Paul, CH-Wang, Sky, Chai, Joyce

arXiv.org Artificial Intelligence

An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners' beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.


The Multi-Modal Video Reasoning and Analyzing Competition

Peng, Haoran, Huang, He, Xu, Li, Li, Tianjiao, Liu, Jun, Rahmani, Hossein, Ke, Qiuhong, Guo, Zhicheng, Wu, Cong, Li, Rongchang, Ye, Mang, Wang, Jiahao, Zhang, Jiaxu, Liu, Yuanzhong, He, Tao, Zhang, Fuwei, Liu, Xianbin, Lin, Tao

arXiv.org Artificial Intelligence

In this paper, we introduce the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) workshop in conjunction with ICCV 2021. This competition is composed of four different tracks, namely, video question answering, skeleton-based action recognition, fisheye video-based action recognition, and person re-identification, which are based on two datasets: SUTD-TrafficQA and UAV-Human. We summarize the top-performing methods submitted by the participants in this competition and show their results achieved in the competition.